Transformer Combining Vision And Language